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Abstract 

Background: Global network alignment has been proposed as an effective tool for computing functional orthology. 
Commonly used global alignment techniques such as IsoRank rely on a two-step process: the first step is an iterative 
diffusion-based approach for assigning similarity scores to all possible node pairs (matchings); the second step applies 
a maximum-weight bipartite matching algorithm to this similarity score matrix to identify orthologous node pairs. 
While demonstrably successful in identifying orthologies beyond those based on sequences, this two-step process is 
computationally expensive. Recent work on computation of node-pair similarity matrices has demonstrated that the 
computational cost of the first step can be significantly reduced. The use of these accelerated methods renders the 
bipartite matching step as the dominant computational cost. This motivates a critical assessment of the tradeoffs of 
computational cost and solution quality (matching quality, topological matches, and biological significance) 
associated with the bipartite matching step. In this paper we utilize the state-of-the-art core diffusion-based step in 
IsoRank for similarity matrix computation, and couple it with two heuristic bipartite matching algorithms - a 
matrix-based greedy approach, and a tunable, adaptive, auction-based matching algorithm developed by us. We then 
compare our implementations against the performance and quality characteristics of the solution produced by the 
reference IsoRank binary, which also implements an optimal matching algorithm. 

Results: Using heuristic matching algorithms in the IsoRank pipeline exhibits dramatic speedup improvements; 
typically x30 times faster for the total alignment process in most cases of interest. More surprisingly, these 
improvements in compute times are typically accompanied by better or comparable topological and biological quality 
for the network alignments generated. These measures are quantified by the number of conserved edges in the 
alignment graph, the percentage of enriched components, and the total number of covered Gene Ontology (GO) 
terms. 

Conclusions: We have demonstrated significant reductions in global network alignment computation times by 
coupling heuristic bipartite matching methods with the similarity scoring step of the IsoRank procedure. Our heuristic 
matching techniques maintain comparable - if not better - quality in resulting alignments. A consequence of our 
work is that network-alignment based orthologies can be computed within minutes (as compared to hours) on typical 
protein interaction networks, enabling a more comprehensive tuning of alignment parameters for refined orthologies. 



Background 

The description of a cell as a collection of pathways 
of interacting biochemical components is fundamental 
to a systems view of biological processes. Data relating 
to regulatory, metabolic, and signaling interactions, is 



"Correspondence: gkollias@purdue.edu; madan.sathe@unibas.ch; 
mohammas@purdue.edu; ayg@cs.purdue.edu 

1 Department of Computer Science, Purdue University, 305 N. University 
Street, West Lafayette, IN 47907, USA 

2 Department of Mathematics and Computer Science, University of Basel, 
Klingelbergstrasse 50, 4056 Basel, Switzerland 

Full list of author information is available at the end of the article 



systematically, and naturally encoded into networks [1]. 
The diversity of species, cellular processes, abstractions, 
and volume of interaction data generated from high- 
throughput techniques strongly motivates development 
of effective and efficient analysis algorithms. Over the 
past two decades, significant progress has been made 
on algorithms for identifying conserved components, 
discriminating components, modularity, clustering, and 
alignment, of sequences, sets, and special graph structures 
(trees, DAGs). However, solving these problems for gen- 
eral large sparse graphs, while providing sound statistical 
basis for results, remains a topic of significant ongoing 
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investigation. Effective solutions to these problems must 
leverage properties of specific datasets to deliver desirable 
performance. 

In this paper, we focus on the problem of network 
alignment. This problem aims to quantify the similarity 
of two given graphs, resulting in the mapping of nodes 
from one graph to the other, along with the induced 
edge mapping. As a specific instance of this problem, 
in protein interaction (PPI) networks, physically interact- 
ing proteins are represented as edge-connected nodes. 
Identifying (topological) regions of similarity between 
networks of different species reveals insights into the 
functional organization and coherence of sub-networks. 
Specifically, if connected sub-graphs are conserved across 
species, they likely correspond to shared function across 
and within sub-graphs. This can be used for annotat- 
ing proteins (by mapping annotations across species), 
inferring missing interactions, and drawing functional 
orthologies. 

Network alignments can be derived from local or global 
measures of cost. In local network alignment (LNA) [2-4], 
local scoring functions reward potentially small subgraph 
matches. In PPIs, it follows that a node (protein) may 
potentially participate in many mappings. In contrast, 
global network alignment (GNA) [5-7] relies on a cost 
function denned over entire networks. This implies that 
in PPIs, a single protein from one network is mapped to a 
single protein in the other. 

Proposed approaches to global alignment typically pro- 
ceed in two steps: 

• In the first step, a similarity score matrix X is 
constructed, where element xtj denotes the similarity 
of node i in the first graph to node ; in the second 
graph. 

• In the second step, a node matching algorithm selects 
pairs of nodes, one from each network, optimizing an 
aggregate similarity score measure using matrix X 

The similarity of two nodes is determined by the sim- 
ilarity of their interaction profiles, also called topological 
similarity, and the inherent similarity of the nodes. The 
latter notion of inherent similarity is introduced in by 
Singh et al. [6,7], as elemental (or node) similarity. Ele- 
mental similarity scores supplement topological similar- 
ity, and rely on node labels or attributes. As an example, 
for protein pairs, sequence similarity scores, indepen- 
dently computed by BLAST, can be used for elemental 
similarity. The method of Singh et al. [6,7], called Iso- 
Rank, uses a notion of topological similarity based on 
an iterative diffusion process for computing matrix X 
In this method, the similarity of a pair of nodes is iter- 
atively determined by the similarity of their neighbors. 



The topological similarity thus computed is accumulated 
with the elemental similarity at each iterative step and the 
process is run to convergence to yield matrix similarity 
matrix X 

In the second step of the method, best-matching pairs of 
nodes are identified. The matrix X is viewed as a weighted 
bipartite graph G = (Va>Vb>E)> where node i of the 
first graph represents a vertex in Va> node j of the sec- 
ond graph illustrates a vertex in Vg and matrix entry xy 
represents a weighted edge between the two nodes. Bipar- 
tite graph matching algorithms can be applied to obtain a 
set of edges M, M c E, such that no pair of edges in M 
are incident on/to the same vertex. Furthermore, the sum 
over the weighted edges inM is maximized (IsoRank [6,7], 
NetAlignBP [8], H-GRAAL [9]). 

In IsoRank, the computation of matrix X represents 
the dominant cost. However, algorithmic improvements 
to this step have resulted in significant reductions in 
the cost of similarity matrix computation [10], particu- 
larly in cases with a small set of dominant, elemental 
similarity components. This has resulted in the second, 
bipartite matching step now becoming the performance 
bottleneck, and consequently, the focus of performance 
improvements. 

In this paper, we critically examine the need for opti- 
mal bipartite matching, suitable heuristic algorithms for 
bipartite matching, the associated improvements in over- 
all runtime, and their implications for quality of solution, 
both in terms of biological implications and topological 
matches. 

Specifically, for a series of PPI network pairs, we exper- 
iment with our customized implementation of the two 
stage pipeline: we utilize the IsoRank scoring matrix 
calculation as the first stage, followed by the applica- 
tion of the matrix-based, greedy, and adaptive, auction- 
based matching algorithms. These two methods gen- 
erate two alignments for each input network pair - 
(mat3_greedy and mat3_auction results). We then run, 
for the same input networks and parameters, the ref- 
erence native binary implementation of IsoRank made 
available by Singh et al.[6], who report that their software 
applies the greedy and Hungarian algorithms to the scor- 
ing matrix X, and produces iso_greedy and iso_hungarian 
alignments. 

Our key result is an improvement in overall runtime 
of over one order of magnitude - typically an acceler- 
ation of x30 for almost all tested PPI network pairs - 
with comparable, and in some cases superior results in 
terms of topological and biological quality (mat3_* ver- 
sus /so_* computations). This represents a reduction in 
runtime for typical cases of PPI networks from hours to 
minutes - thus enabling a more comprehensive explo- 
ration of the parameter space for extracting desirable 
orthologies. 
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Results and discussion 

We report on the computational cost and quality char- 
acteristics of our mat3_* alignments in comparison with 
the ones generated by the reference native binary imple- 
mentation of the IsoRank (iso_* results). The set of PPI 
networks, sequence similarity of their corresponding pro- 
teins, and the executable files for IsoRank [6,11] were 
obtained through the IsoRank public website [11]. PPI 
networks in this dataset, the list of which is provided in 
Table 1, were collected from publicly available databases, 
such as BioGRID and DIP, as well as the datasets of Stelzl 
and Vidal. Associated sequence similarities were gener- 
ated by using BLAST on the sequences retrieved from 
Ensembl. We use a = 0.80 for fixed number of iterations 
(20) in all experiments. 

Computational performance 

Computation times for mat3_* and iso_* implementations 
are plotted in Figure 1. 
Two important observations can be made: 

• Our mat3_* matches are generated approximately 30 
times faster than iso_*, for almost all pairs of 
networks. For the largest datasets this roughly 
translates to less than 5 minutes in our case 
(compared to iso_*'s 2 hours of processing time). 

• The adaptive auction algorithm is a key element in 
this performance improvement, as opposed to the 
Hungarian algorithm — assuming IsoRank 
(constructing X) and greedy matching 
implementations are of comparable performance. 
Please note that we have no way of benchmarking 
these separately, since these times are not reported by 
the reference implementation of Singh et al. One may 
argue that our implementations of similarity matrix 
computations are much faster than corresponding 
implementations of Singh et al. While this is unlikely, 
even if this were to be the case, our significant 
performance improvements over the most widely 
distributed implementation of IsoRank represent the 
core of our contributions. 

We also note that, especially for the largest pair of 
networks, our adaptive auction algorithm is the most effi- 
cient method for extracting matching pairs from similarity 



Table 1 PPI network data 



Species 


Dataset name 


\v\ 


\E\ 


fruitfly(D. melanogaster) 


dmela 


7518 


25830 


bacterium(E coli) 


ecoli 


1821 


6849 


human(H. sapiens) 


hsapi 


9633 


36386 


yeast(S. cerevisioe) 


scere 


5499 


31898 



Species for which PPI network data is used in our experiments. 



score matrices. Specifically, the only timing information 
that is available after running the native binary from [11] 
is the overall running time, which includes three sepa- 
rate components - (i) their similarity matrix construction 
phase, (ii) their Hungarian algorithm run, and (iii) their 
greedy approach. These three times cannot be separated. 
However we can measure, separately, the times for (i) 
our similarity matrix construction phase, (ii) our auction 
matching run, and (iii) our greedy approach run. Only the 
implementation of the second part, i.e, their Hungarian 
versus our auction algorithm, are fundamentally different. 
The corresponding similarity computations in the first 
phase and the greedy matching algorithm rely on simi- 
lar algorithms. This suggests that the performance gain 
that we observe in the overall running time from the three 
parts in both cases (roughly x30 speedup) can be primar- 
ily attributed to the performance difference between their 
Hungarian and our auction algorithm implementations. 

In terms of the number of operations, the complexity 
of the Hungarian algorithm is 0(\V\(\E\ + |V|log|V|)). 
This is in comparison to the worst case complex- 
ity of an £-scaling auction algorithm, which is given 
by 0(|^||£|log(|^|C)) for integer weights (with C = 
max^y \xij\ and Xy the similarity scores). Here \V\ is the 
maximum number of vertices between the two graphs to 
be aligned and \E\ is the number of entries in X. Exper- 
imental results suggest that the worst case runtime for 
the auction algorithm is rare. Other causes for the speed 
improvements include suboptimal data structures in sim- 
ilarity matrix computations in the binary code, or an 
unoptimized Hungarian implementation. These are hard 
to decipher from the binary - in either case, our software 
yields significant overall acceleration of the state-of-the- 
art approach from Singh et al. for the global alignment 
problem. 

The alignment graph and its assessment 
Topological perspective 

In Figure 2, the number of conserved edges in the align- 
ment graph is reported. Our proposed mat3 .auction 
approach outperforms the other methods in at least 2 out 
of the 6 cases - more conserved edges imply better align- 
ments, as described in the Methods Section - and that 
the mat3_* matchings are superior to the iso_* ones (in 5 
out of 6 cases, and on average). However there is no clear 
"winner", i.e. a single method that is the best for all test 
cases. 

Biological perspective 

To assess the functional coherence of computed align- 
ments, we report on their sensitivity and specificity 
(please see Methods Section for more details). Table 2 
summarizes the corresponding statistics. We then pro- 
vide a detailed analysis for a subset of the top-ranked 
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■ similarity 



■ greedy auction speed iso/mat3 




20 E 



dmela-ecoli dmela-hsapi dmela-scere ecoli-hsapi ecoli-scere hsapi-scere 
Selected species pairs 

Figure 1 Timing results. Timing results for extracting all four types of matches of Figure 4. Times for obtaining mat3_* matches are represented as 
bars with colors identifying the relative contribution of each of the three algorithmic blocks (mat3, greedy, auction). The black line gives the 
corresponding speedup (as read on the right vertical axis) in generating our mat3_* matches relative to iso_* ones. 



components with co-enriched terms, where a functional 
term is enriched in the respective subgraphs of both input 
networks (from different species). 

Sensitivity 

In terms of overall sensitivity (average of true-positive rate 
(TPR) in the pair of aligned species), mat3 .auction shows 
superior performance, except in aligning human-versus- 
fly, where mat3_g reedy outperforms the mat3_auction 
method. It is notable here that sensitivity is not com- 
parable across different tables, since the total number 
of expected enriched terms is a function of evolution- 
ary distance among species pairs that are being aligned. 
Species that have diverged more recently are more prob- 
able to have common pathways, which results in higher 
number of identified terms. However, for a fixed pair 
of species, which defines a unique functional space, we 



can use sensitivity as a measure to compare different 
methods. 

Specificity 

In terms of average specificity, we observe more diver- 
sity among different methods. Surprisingly, mat3_g reedy 
is the top-ranked method in 4 out of 6 experiments, 
except in aligning yeast- versus-bacterium and bacterium- 
versus-human, for which iso_hungarian and mat3_auction 
perform better, respectively. These results suggest that the 
well-known Hungarian algorithm for maximum weighted 
bipartite matching does not necessarily enhance the bio- 
logical quality of the results. One possible explanation for 
this phenomena is the over-fitting problem. The objec- 
tive function for the matching phase is denned over the 
set of pairwise similarity scores for the nodes in different 
graphs, which itself is computed using both sequence and 



■ mat3_greedy mat3_auction ■ iso_hungarian ■ iso_greedy 



hsapi-scere 
ecoli-scere 



ecoli-hsapi 
dmela-scere 



dmela-ecoli ^^^^^^^^^ 

0 200 400 600 800 1000 1200 1400 1600 1800 2000 

Number of conserved edges 

Figure 2 Number of conserved edges in the alignment graphs. Number of conserved edges in the alignment graphs for all combinations of 
species pairs and computation methods. 
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Table 2 Biological validation of alignment graphs 



(a) D.melanogaster vs S.cerevisiae 






Fly 




Yeast 




Method 


TPR 


TNR 


TPR 




TNR 


iso_greedy 


535 


1 7.00% 


324 




43.00% 


iso_hungarian 


457 


15.20% 


1066 




44.00% 


mat3 .auction 


490 


15.40% 


1132 




38.30% 


mat3_g reedy 


448 


1 7.00% 


1102 




44.00% 


(b) D.melanogaster vs H.sapiens 






Fly 




Human 




Method 


TPR 


TNR 


TPR 




TNR 


iso_greedy 


1067 


30.10% 


191 




9.00% 


iso_hungarian 


974 


31.50% 


671 




14.30% 


mat3 .auction 


1020 


27.00% 


519 




12.50% 


mat3_g reedy 


1029 


30.80% 


670 




1 7.00% 


(c) D.melanogaster vs E.coli 






Fly 




Bacterium 




Method 


TPR 


TNR 


TPR 




TNR 


iso_greedy 


41 


1 1 .63% 


66 




16.60% 


iso_hungarian 


60 


1 0.38% 


175 




29.35% 


mat3 .auction 


90 


9.80% 


236 




32.00% 


mat3_greedy 


56 


1 0.73% 


235 




32.58% 


(d) H.sapiens vs S.cerevisiae 






Human 




Yeast 




Method 


TPR 


TNR 


TPR 




TNR 


iso_greedy 


971 


31.74% 


406 




1 7.97% 


iso_hungarian 


1010 


31.27% 


1063 




41.76% 


mat3 .auction 


1064 


28.40% 


1083 




40.16% 


mat3_greedy 


989 


33.33% 


1095 




46.14% 


(e) S.cerevisiae vs E.coli 






Yeast 




Bacterium 




Method 


TPR 


TNR 


TPR 




TNR 


iso_greedy 


419 


46.10% 


63 




15.00% 


iso_hungarian 


359 


46.60% 


231 




38.24% 


mat3 .auction 


400 


40.72% 


274 




35.75% 


mat3_greedy 


378 


44.22% 


289 




35.37% 


(f) E.coli vs H.sapiens 






Bacterium 




Human 




Method 


TPR 


TNR 


TPR 




TNR 


iso_greedy 


52 


16.77% 


472 




29.56% 


iso_hungarian 


158 


35.63% 


386 




28.40% 


mat3 .auction 


252 


32.96% 


557 




34.80% 


mat3_greedy 


219 


36.00% 


444 




29.73% 



Biological validation of alignment graphs based on the method of Kalaev et al. [21 ]. The TPR column contains the total number of GO terms covered in the biological 
process branch, and the TNR column represents the percent of components that were enriched in each species. 
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topological similarities. We note that both of the ini- 
tial scores - sequence similarity scores computed using 
BLAST, and aggregated scores computed using IsoRank, 
are inherently noisy and over-fitting a model (matching) 
on them can potentially decrease the performance of the 
results. 

Examples of highly enriched components 

We also evaluate the performance of the mat3_auction 
method by extracting components that are highly 
enriched in both species with respect to a unique GO 
term. We manually curated the components identified 
by mat3 .auction and selected four significant compo- 
nents for our case study, spanning four different species 
pairs. These components, which are shown in Figure 3, 
reveal that there is a strong correlation between structural 
conservation and functional similarity, which has been 
faithfully recovered by mat3 .auction. Most of these com- 
ponents have more than one co-enriched GO term, but 
interestingly enough, these terms are coherent in the sense 
that they describe the same function in more or less detail. 
For example, component 3(b) is annotated with RNA pro- 
cessing, which is a generalization of the term nuclear 
mRNA splicing via spliceosome, another enriched term in 
3(b). Similarly, component 3(c) is enriched with both his- 
tone acetylation and histone modification. The first term is 
clearly a refinement of the type of modification. 



These functionally similar modules exhibit closely 
related structure. In most cases, the smaller subgraph 
and all its edges can be perfectly embedded in the other 
species. The set of missed interactions can be explained by 
either evolutionary divergence or the quality of the input 
dataset. 

From an evolutionary standpoint, one can argue that 
the set of orthologous genes in the pair of species could 
have diverged and either gained or lost specific functions. 
These functional adaptations are reflected in the pattern 
of protein-protein interactions by deleting or inserting 
partnerships. On the other hand, an easier argument can 
be made based on the quality and the coverage of pro- 
teome, noting that different PPI interactions have been 
predicted in different labs for each species. Some of 
these species are well-studied and there are more high- 
quality interactions available for them. As is evident from 
Figures 3(a) and 3(d), most of the missed edges reside in 
yeast, which can simply be explained by heterogeneity in 
the quality of initial PPI networks. 

Conclusions 

Our results show that the IsoRank-based method for 
computing similarity scores between nodes of two PPI 
networks, coupled with a fast, adaptive, auction-based 
implementation and a matrix-based greedy algorithm for 




D.melanogaster 

Figure 3 Conserved components identified by the mat3_auction method. Conserved components identified by the mat3_auction method. 
Each component is coherently annotated with specific branch of the biological process, (a) Peroxisome organization, (b) RNA splicing, (c) Histone 
modification, (d) Ribosome biogenesis. 
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extracting matching proteins, yields an efficient and effec- 
tive algorithm for global alignment. The method yields 
over one order of magnitude improvement in compu- 
tation time - typically x30 times faster for the total 
alignment process in most cases of interest - with com- 
parable or superior topological and biological quality of 
the results. Using out method, alignments can be com- 
puted within minutes (less than 5 minutes for typical 
alignments), as compared to hours, using previously used 
methods (of the order of 130 minutes for the largest input 
configurations). This enables users to tune alignment 
parameters much better and extract superior alignments. 

Methods 

Ranking node pairs in IsoRank 

We use the similarity matrix construction step in IsoRank 
in all alignment methods. This method is based on an 
analogy between the network alignment problem and that 
of identifying "reputed" nodes in a single network - also 
sometimes called the page- or node-ranking problem. Per- 
haps, the most commonly used measure for the rank of a 
node in a single given network can be recursively denned 
as follows: "a node is important if it is linked to other 
important nodes" [12]. Extending this definition to our 
node similarity problem, we arrive at the following def- 
inition: "two nodes are (topologically) similar if they are 
linked to other (topologically) similar node pairs" [6,7,13]. 
IsoRank effectively implements this notion of "recursive 
node similarity". Note that this perspective is not new - 
it has been exploited in application areas like automated 
image captioning [14] and synonym extraction [15]. 

We initiate a more formal discussion by introducing 
necessary notation. Given a graph Ga = (Va>Ea)> Va and 
Ea denote the vertices and edges of Ga respectively, and 
Ha = |VX|. Its adjacency matrix A has elements ay = 1 
iff edge (/,/) e E, and ay = 0 otherwise. Clearly, A 
is symmetric for an undirected graph. Matrix A is the 
normalized version of the matrix A T ; formally, (A)# = 
dji/ YTit\ a ji f° r nonzero rows of A and zero otherwise. 
We denote by 1, the column vector of size via consisting of 
Is. Also, the vec(-) operator for stacking matrix columns 
into a vector (as well as its associated "inverse" unvec(-) 
operator for re-assembling the matrix) are used. Using 
these operators, vec(AXB) = (B T ® A)x, holds, where ® 
denotes the Kronecker product of two matrices. 

In IsoRank, vertex similarity scores in (PPI) networks 
are computed by integrating both vertex attributes (simi- 
larity of protein sequences) and topological affinity (links 
to similar nodes). More specifically, in [6], Singh et al. 
introduce the following iterative procedure for computing 
similarity scores: 

x <r- aA <g> Bx + (1 - a)h. (1) 



Here x = veciX) is the vector vertically stacking the 
columns of the similarity matrix X having as entries the 
similarity scores xy. h = veciH), similarly stacks ele- 
ments hy of matrix H, i.e., the elemental similarity scores 
between nodes i e Vb and j e Va< Vectors x and h are 
normalized to unity. Successive iterates scale topological 
similarity and elemental similarity of nodes by factors a < 
1 and 1 — a, respectively. In the context of PPI networks, 
vector h encodes protein sequence similarity scores in par- 
ticular, and protein interaction networks Ga and Gb are 
assumed to be undirected. By "unvec"ing (1), we obtain: 

X +- aBXA T + (1 - a)H. (2) 
Matching algorithms 

As a first step in identifying similar subgraphs in the 
two networks, IsoRank computes a |Va|-dv-|Vb| similar- 
ity score matrix X. We assume without loss of generality 
that \Va\ < \Vb\- The matrix can be viewed as encod- 
ing a weighted bipartite graph G = (Va, Vb>E, w), where 
Va H Vb = 0, E c Va x V#, and the weight function 
w : E —> R-°. Each row represents a vertex in Va and each 
column a vertex in Vg. A non-zero entry in the matrix is 
interpreted as an edge between the row and column ver- 
tices. The numerical value xy of the matrix indicates the 
weight of the edge. A matching in the bipartite graph is 
defined as a subset M c E such that no pair of edges of 
M are incident on the same vertex. In a maximal match- 
ing, no edge can be added to M without violating the 
matching property. A maximum (cardinality) matching is 
a matching that contains the largest possible number of 
edges. Specifically, we are interested in finding a maxi- 
mum matching with maximum weight. The weight of the 
matching is denned as J2(i,j)eM x ij> tn * s typically trans- 
lates to a large number of matching pairs with a high 
cumulative similarity score. 

Various algorithms have been proposed for computing 
a weighted matching M for a given similarity matrix X. 
Approximation algorithms [16], which compute a max- 
imal matching with a maximum weight, and maximum 
weighted matching algorithms [17,18], which return a 
maximum matching with a maximum weight, are all can- 
didates for finding a suitable assignment. In our approach, 
we use the 1/2-approximation algorithm and propose a 
weighted matching implementation that is based on the 
principle of auctions. The main advantage of using an 
auction-based scheme is the so-called ^-scaling mech- 
anism, using which the quality and convergence of the 
algorithm can be controlled. 

1/2-approximation algorithm 

This simple approximation algorithm can be described 
as follows: First, the weights of the edges are sorted in 
decreasing order. Then, the heaviest edge e is selected and 
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deleted, together with the edges incident to its endpoints. 
This is repeated until the graph is empty. The worst case 
complexity of this sorting procedure is (D(\E\ log \E\). We 
implement the linear- time 1/2-approximation algorithm 
of Preis [16], of known time complexity (D(\E\) (Algo- 
rithm 1), translating this graph-based description into 
matrix operations (matrix-based greedy algorithm): After 
selecting the element with maximum value xy from the 
similarity matrix I ^ 0, we report the matching of nodes 
i and /. We then zero the i th row and the j th column of X, 
and repeat the aforementioned step until X contains only 
zeros across one of its dimensions. 

Algorithm 1 1/2-approximation algorithm for weighted 
matching 

1: M = 0; 

2: while£^0do 

3: Take an {a, b) e E with locally heaviest weight; 
4: M = M U {a, b}; 

5: Remove all edges incident to a or b from E; 
6: end while 



Auction algorithms 

Auction algorithms [18] find the maximum weighted 
matching via an auction: in this analogy, i e Va is a 
person, j e Vb is an object and xy is the benefit the 
buyer i obtains by acquiring object Each object j has an 
associated price pj, with the initial value zero. 



subtracting the second-best profit v; from the most valu- 
able profit Ui, i.e., U{ — v/. The most valuable profit U{ 
for buyer i is defined as {x^ — pj.}, while the second-best 
profit Vi is computed by max ; y ; -.{^y — pj). After the unas- 
signed buyer has submitted the bid, the designated object 
is awarded to the bidder, yielding its potential previous 
owner unassigned. The price is calculated by updating 
the old price by the corresponding bid and by a small 
increment s. It follows that the auction-based algorithm 
(see Algorithm 2 for a simplified description also assum- 
ing integer Xij) consists of four phases: the initialization 
phase (lines 1-7), the bidding phase (lines 9-12), the 
assignment phase (lines 13-15), and the termination phase 
(line 8). 

The initial value of increment s has significant impact 
on the computational cost of the auction algorithm. Ide- 
ally, the value of s should be close to the optimal value 
of the price, since the number of iterations to find a 
matching will be small. In general, the computational 
worst case complexity of an ^-scaling auction algorithm is 
G(\V\\E\ log(| V|Q) (where C is denned in line 5 of Algo- 
rithm 2). Due to the pseudo-polynomial complexity, we 
embed an aggressive ^-scaling strategy into the auction- 
based implementation of Algorithm 2, resulting in our 
adaptive auction algorithm, which effectively splits the e- 
scaling phase into multiple ^-scaling phases; its steps are 
detailed in Algorithm 3. 



Algorithm 2 Auction Algorithm 



Algorithm 3 Adaptive Auction Algorithm 



M^0; 

/ +- [i : 1 < i < n A ); 
Pj <- 0 for j = 1, . . . , ns; 
t ±- 0; 

C <- max,-/ \xij\; 



> current matching 
> set of unassigned buyers 
> initialize prices to 0 
> iteration counter 
> maximum value in G 



0 <r- 16; f 

b ^ e> 

while / 7^ 0 do 

;/ <r- argmax ; Ky - pj); 

U{ X{j i — Pj.', 

vt +- maxjfiifaj -pj); 
PJi ^Pji + Ui-Vi + e; 
M ^MU {i,ji); I^I\{i); 
M ^ M \ {k,ji); I <- IU {!<}; 
t<^t+l; 

8 ±- max{^,e- f}; 
end while 



> initialize s with a large value 

> auction iteration 

> find best objects 



> update price 
> update M, I 



In an auction iteration, the bidding and assignment 
phase, and update of the price and of the value of s are 
performed. In the bidding phase, an unassigned buyer 
i bids for the best object i.e., the object that has 
the maximal profit for buyer /. The bid is computed by 



9: 
10: 
11: 



Perform initialization phase of algorithm 2 (lines 
1-4); 

§ +- 4; 0 +- 16; 

8 <- I min | f , f | I ; > Initialize threshold 8 

while |/| > Odo 

while^f/f > 8 do 

Perform bidding and assignment phase of 

algorithm 2 (lines 9-15); 
£ <- s • §; 
end while 

8 <r- p , 

end while 



Here, the £ -value is initialized with a small value and 
adaptively increased relative to the overall progress of the 
matching. The basic idea behind the proposed heuristic is 
that in the inner iteration, at least 8 buyers get assigned 
to an object, while s pushes the price of the object to 
a large value. In general, the heuristic provides a maxi- 
mal matching with a maximum weight. Since the matrix 
X is dense, a simple greedy approach is applied subse- 
quently, in order to achieve a maximum matching. The 
post-processing method assigns an unmatched buyer to 
the first unmatched object from the object list, resulting 
in a maximum matching with maximum weight. 
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For a more detailed presentation of the auction algo- 
rithm we use, we refer readers to Sathe et al. [19], where a 
scalable distributed version of the algorithm is described. 
This formulation computes weighted matchings on sparse 
and dense bipartite graphs running on hundreds of com- 
pute nodes, while efficiently using multiple cores on each 
compute node. 

Numerical experiments 

In all cases, the construction of the similarity matrix X 
is IsoRank-based. We implemented the iterative scheme 
of Equation (2) in Matlab (mat3_* cases — name mat3 is 
used to indicate a triple-matrix product kernel) and tested 
this part against equivalent codes in the netalign package 
[20], yielding exact agreement (within machine accuracy). 
The resulting matrix X of similarity scores is then input 
to (i) our matrix-based implementation (in Matlab) of 
the 1/2-approximation algorithm (hereafter referred to as 
greedy) to produce the mat3 .greedy alignment and (ii) 
to the adaptive auction algorithm (in C) to generate the 
respective mat3 .auction alignment. 

We then execute, for the same runtime parameters, the 
reference IsoRank binary for producing iso_g reedy and 
iso_hungarian alignments and compare them against our 
approach (mat3_* alignments). Unfortunately the refer- 
ence native code does not provide an option for generating 
either the similarity matrix X, or the timings for its com- 
putation. It internally uses the result of this first phase to 
extract the best matching node pairs, the second phase. 
The total timing results — for both phases — are reported 
together with the extracted matchings pairs in this case. 
Specifically, results from two matching algorithms are 
reported [6], namely their implementation of the greedy 
and Hungarian algorithms - thus justifying the selection 
of iso.greedy and iso_hungarian names for the complete 
alignment pipelines. It is understood that marked devi- 
ations could arise due to internal optimization-targeted 
realizations of the algorithmic ideas, however these are 
not readily accessible or reported. Figure 4 summarizes 
the algorithmic blocks driving the generations of four sets 
of matchings for each pair of PPI networks. 

The alignment graph and its assessment 

Given a set of matching node pairs (one of the four mat3_*, 
iso_* possible alternatives in this context), we need to eval- 
uate solution quality. The alignment graph is the auxiliary 
structure built to facilitate this task. Each node in the 
alignment graph is a matching node pair from the com- 
puted set of matches. Furthermore, if m\ = (h,ji) and 
^2 = (h>h) m Vk x Vb are two such matches, then m\ 
and m<i are connected by an edge iff i\ is connected with 
*2 and j\ with j2 in the input graphs Ga and G#. Essen- 
tially, the alignment graph identifies the link structures 
that remain invariant when replacing nodes of one graph 



greedy 



networks 




elemental similarities 
as matrix 



networks 



IsoRank 



T 



mat3_greedy 



matches 



mat3 auction 



iso_greedy 

matches 
iso_hungarian 



elemental similarities 
as matrix 

Figure 4 Algorithmic blocks Algorithmic blocks for the 
generation of the four types of matches (mat3 .greedy, 
mat3 .auction, iso_greedy, iso.hungarian) under comparison. 



with their matching counterparts in the other; it captures 
how our computed matchings between nodes preserves 
implicit/induced matchings between their incident edges. 

Topological perspective 

When analyzing the alignment graph of two networks, 
two measures for the topological evaluation of the com- 
puted match can be used: 

• The number of edges in the alignment graph 
(conserved edges). Note that the more the conserved 
edges - incident on our matching nodes in the two 
networks - the larger the percentage of "link 
structure" (edges at minimal or larger connected 
subgraphs) that could be put under direct mapping as 
well, i.e. aligned. 

• The size of the connected components in the 
alignment graph (common connected subgraphs). 
These are "clusters" of pairs of matching nodes in the 
original graphs also conserving their link patterns. 

The existence of many conserved edges clearly increases 
the probability of them being part of extensive connected 
subgraphs. However, it could also be the case that they 
are parts of larger numbers of connected subgraphs, all of 
moderate sizes. Comparative, connected subgraph statis- 
tics is the main focus of quality assessment from the 
biological point of view. 

Biological perspective 

Topological assessment of the alignment graph can 
uncover important characteristics of each alignment 
method. However, additional considerations must be 
taken into account to avoid growing components at the 
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expense of over-generalizing them. This can negatively 
affect the specificity of predictions. For example, by ana- 
lyzing the size of connected components, one can penalize 
against fragmenting functional modules. An alignment 
with many singletons (isolated pairs of nodes in the align- 
ment graph) is less desirable than a larger connected 
component that can group a number of related nodes 
(and their corresponding proteins) together. On the other 
hand, by mixing functionally independent groups that 
share a small subset of genes, one can create larger com- 
ponents that are not functionally coherent. To remedy 
this problem, we adopt an approach similar to the one 
proposed by Kalaev et al. [21] to assess the functional 
coherence of each connected component in the alignment 
graph. In this approach, each connected component is 
treated as a computational prediction of a functionally 
related group of proteins and is cross-validated against 
the existing GO annotations as the actual set of func- 
tionally related genes. Given the gene ontology (GO) [22] 
annotations with respect to biological process (BP) for 
the member genes of each species, we compute the set 
of enriched GO terms within each connected component 
in the alignment graph using the GO::TermFinder tool 
[23]. We process each graph separately and, similar to 
Kalaev et al. [21], apply a threshold of 0.05 to extract the 
set of enriched GO terms. Finally, we define two com- 
plementary criteria to validate each alignment method - 
the fraction of enriched components (components with 
at least one enriched GO term) and the total number of 
covered GO terms in all components. The former cap- 
tures specificity (true negative rate or TNR), while the 
latter captures the concept of sensitivity (true positive rate 
orTPR). 

Source code and datasets 

All online material for this project is available at http:// 
compbio.soihub.org/projects/fastalign/. This includes 
Matlab scripts, C code for the (more general, distributed 
version of the) auction algorithm, input datasets, and also 
data generated during intermediate phases of an actual 
run. 
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